    Density-based algorithms for active and anytime clustering

    Data intensive applications like biology, medicine, and neuroscience require effective and efficient data mining technologies. Advanced data acquisition methods produce a constantly increasing volume and complexity. As a consequence, the need of new data mining technologies to deal with complex data has emerged during the last decades. In this thesis, we focus on the data mining task of clustering in which objects are separated in different groups (clusters) such that objects inside a cluster are more similar than objects in different clusters. Particularly, we consider density-based clustering algorithms and their applications in biomedicine. The core idea of the density-based clustering algorithm DBSCAN is that each object within a cluster must have a certain number of other objects inside its neighborhood. Compared with other clustering algorithms, DBSCAN has many attractive benefits, e.g., it can detect clusters with arbitrary shape and is robust to outliers, etc. Thus, DBSCAN has attracted a lot of research interest during the last decades with many extensions and applications. In the first part of this thesis, we aim at developing new algorithms based on the DBSCAN paradigm to deal with the new challenges of complex data, particularly expensive distance measures and incomplete availability of the distance matrix. Like many other clustering algorithms, DBSCAN suffers from poor performance when facing expensive distance measures for complex data. To tackle this problem, we propose a new algorithm based on the DBSCAN paradigm, called Anytime Density-based Clustering (A-DBSCAN), that works in an anytime scheme: in contrast to the original batch scheme of DBSCAN, the algorithm A-DBSCAN first produces a quick approximation of the clustering result and then continuously refines the result during the further run. Experts can interrupt the algorithm, examine the results, and choose between (1) stopping the algorithm at any time whenever they are satisfied with the result to save runtime and (2) continuing the algorithm to achieve better results. Such kind of anytime scheme has been proven in the literature as a very useful technique when dealing with time consuming problems. We also introduced an extended version of A-DBSCAN called A-DBSCAN-XS which is more efficient and effective than A-DBSCAN when dealing with expensive distance measures. Since DBSCAN relies on the cardinality of the neighborhood of objects, it requires the full distance matrix to perform. For complex data, these distances are usually expensive, time consuming or even impossible to acquire due to high cost, high time complexity, noisy and missing data, etc. Motivated by these potential difficulties of acquiring the distances among objects, we propose another approach for DBSCAN, called Active Density-based Clustering (Act-DBSCAN). Given a budget limitation B, Act-DBSCAN is only allowed to use up to B pairwise distances ideally to produce the same result as if it has the entire distance matrix at hand. The general idea of Act-DBSCAN is that it actively selects the most promising pairs of objects to calculate the distances between them and tries to approximate as much as possible the desired clustering result with each distance calculation. This scheme provides an efficient way to reduce the total cost needed to perform the clustering. Thus it limits the potential weakness of DBSCAN when dealing with the distance sparseness problem of complex data. As a fundamental data clustering algorithm, density-based clustering has many applications in diverse fields. In the second part of this thesis, we focus on an application of density-based clustering in neuroscience: the segmentation of the white matter fiber tracts in human brain acquired from Diffusion Tensor Imaging (DTI). We propose a model to evaluate the similarity between two fibers as a combination of structural similarity and connectivity-related similarity of fiber tracts. Various distance measure techniques from fields like time-sequence mining are adapted to calculate the structural similarity of fibers. Density-based clustering is used as the segmentation algorithm. We show how A-DBSCAN and A-DBSCAN-XS are used as novel solutions for the segmentation of massive fiber datasets and provide unique features to assist experts during the fiber segmentation process.Datenintensive Anwendungen wie Biologie, Medizin und Neurowissenschaften erfordern effektive und effiziente Data-Mining-Technologien. Erweiterte Methoden der Datenerfassung erzeugen stetig wachsende Datenmengen und Komplexit\"at. In den letzten Jahrzehnten hat sich daher ein Bedarf an neuen Data-Mining-Technologien f\"ur komplexe Daten ergeben. In dieser Arbeit konzentrieren wir uns auf die Data-Mining-Aufgabe des Clusterings, in der Objekte in verschiedenen Gruppen (Cluster) getrennt werden, so dass Objekte in einem Cluster untereinander viel \"ahnlicher sind als Objekte in verschiedenen Clustern. Insbesondere betrachten wir dichtebasierte Clustering-Algorithmen und ihre Anwendungen in der Biomedizin. Der Kerngedanke des dichtebasierten Clustering-Algorithmus DBSCAN ist, dass jedes Objekt in einem Cluster eine bestimmte Anzahl von anderen Objekten in seiner Nachbarschaft haben muss. Im Vergleich mit anderen Clustering-Algorithmen hat DBSCAN viele attraktive Vorteile, zum Beispiel kann es Cluster mit beliebiger Form erkennen und ist robust gegen\"uber Ausrei{\ss}ern. So hat DBSCAN in den letzten Jahrzehnten gro{\ss}es Forschungsinteresse mit vielen Erweiterungen und Anwendungen auf sich gezogen. Im ersten Teil dieser Arbeit wollen wir auf die Entwicklung neuer Algorithmen eingehen, die auf dem DBSCAN Paradigma basieren, um mit den neuen Herausforderungen der komplexen Daten, insbesondere teurer Abstandsma{\ss}e und unvollst\"andiger Verf\"ugbarkeit der Distanzmatrix umzugehen. Wie viele andere Clustering-Algorithmen leidet DBSCAN an schlechter Per- formanz, wenn es teuren Abstandsma{\ss}en f\"ur komplexe Daten gegen\"uber steht. Um dieses Problem zu l\"osen, schlagen wir einen neuen Algorithmus vor, der auf dem DBSCAN Paradigma basiert, genannt Anytime Density-based Clustering (A-DBSCAN), der mit einem Anytime Schema funktioniert. Im Gegensatz zu dem urspr\"unglichen Schema DBSCAN, erzeugt der Algorithmus A-DBSCAN zuerst eine schnelle Ann\"aherung des Clusterings-Ergebnisses und verfeinert dann kontinuierlich das Ergebnis im weiteren Verlauf. Experten k\"onnen den Algorithmus unterbrechen, die Ergebnisse pr\"ufen und w\"ahlen zwischen (1) Anhalten des Algorithmus zu jeder Zeit, wann immer sie mit dem Ergebnis zufrieden sind, um Laufzeit sparen und (2) Fortsetzen des Algorithmus, um bessere Ergebnisse zu erzielen. Eine solche Art eines "Anytime Schemas" ist in der Literatur als eine sehr n\"utzliche Technik erprobt, wenn zeitaufwendige Problemen anfallen. Wir stellen auch eine erweiterte Version von A-DBSCAN als A-DBSCAN-XS vor, die effizienter und effektiver als A-DBSCAN beim Umgang mit teuren Abstandsma{\ss}en ist. Da DBSCAN auf der Kardinalit\"at der Nachbarschaftsobjekte beruht, ist es notwendig, die volle Distanzmatrix auszurechen. F\"ur komplexe Daten sind diese Distanzen in der Regel teuer, zeitaufwendig oder sogar unm\"oglich zu errechnen, aufgrund der hohen Kosten, einer hohen Zeitkomplexit\"at oder verrauschten und fehlende Daten. Motiviert durch diese m\"oglichen Schwierigkeiten der Berechnung von Entfernungen zwischen Objekten, schlagen wir einen anderen Ansatz f\"ur DBSCAN vor, namentlich Active Density-based Clustering (Act-DBSCAN). Bei einer Budgetbegrenzung B, darf Act-DBSCAN nur bis zu B ideale paarweise Distanzen verwenden, um das gleiche Ergebnis zu produzieren, wie wenn es die gesamte Distanzmatrix zur Hand h\"atte. Die allgemeine Idee von Act-DBSCAN ist, dass es aktiv die erfolgversprechendsten Paare von Objekten w\"ahlt, um die Abst\"ande zwischen ihnen zu berechnen, und versucht, sich so viel wie m\"oglich dem gew\"unschten Clustering mit jeder Abstandsberechnung zu n\"ahern. Dieses Schema bietet eine effiziente M\"oglichkeit, die Gesamtkosten der Durchf\"uhrung des Clusterings zu reduzieren. So schr\"ankt sie die potenzielle Schw\"ache des DBSCAN beim Umgang mit dem Distance Sparseness Problem von komplexen Daten ein. Als fundamentaler Clustering-Algorithmus, hat dichte-basiertes Clustering viele Anwendungen in den unterschiedlichen Bereichen. Im zweiten Teil dieser Arbeit konzentrieren wir uns auf eine Anwendung des dichte-basierten Clusterings in den Neurowissenschaften: Die Segmentierung der wei{\ss}en Substanz bei Faserbahnen im menschlichen Gehirn, die vom Diffusion Tensor Imaging (DTI) erfasst werden. Wir schlagen ein Modell vor, um die \"Ahnlichkeit zwischen zwei Fasern als einer Kombination von struktureller und konnektivit\"atsbezogener \"Ahnlichkeit von Faserbahnen zu beurteilen. Verschiedene Abstandsma{\ss}e aus Bereichen wie dem Time-Sequence Mining werden angepasst, um die strukturelle \"Ahnlichkeit von Fasern zu berechnen. Dichte-basiertes Clustering wird als Segmentierungsalgorithmus verwendet. Wir zeigen, wie A-DBSCAN und A-DBSCAN-XS als neuartige L\"osungen f\"ur die Segmentierung von sehr gro{\ss}en Faserdatens\"atzen verwendet werden, und bieten innovative Funktionen, um Experten w\"ahrend des Fasersegmentierungsprozesses zu unterst\"utzen

    Smartphone indoor positioning based on enhanced BLE beacon multi-lateration

    In this paper, we introduce a smartphone indoor positioning method using bluetooth low energy (BLE) beacon multilateration. At first, based on signal strength analysis, we construct a distance calculation model for BLE beacons. Then, with the aims to improve positioning accuracy, we propose an improved lateral method (range-based method) which is applied for 4 nearby beacons. The method is intended to design a real-time system for some services such as emergency assistance, personal localization and tracking, location-based advertising and marketing, etc. Experimental results show that the proposed method achieves high accuracy when compared with the state of the art lateral methods such as geometry-based (conventional trilateration), least square estimation-based (LSE-based) and weighted LSE-based

    iBeacon-based indoor positioning system: from theory to practical deployment

    Developing an indoor positioning system became essential when global positioning system signals could not work well in indoor environments. Mobile positioning can be accomplished via many radio frequency technology such as Bluetooth low energy (BLE), wireless fidelity (Wi-Fi), ultra-wideband (UWB), and so on. With the pressing need for indoor positioning systems, we, in this work, present a deployment scheme for smartphone using Bluetooth iBeacons. Three main parts, hardware deployment, software deployment, and positioning accuracy assessment, are discussed carefully to find the optimal solution for a complete indoor positioning system. Our application and experimental results show that proposed solution is feasible and indoor positioning system is completely attainable

    An Exploration and an Application of the Recruitment Criteria on Qualified Personnel by the Analytic Hierarchy Process Method at Logistics Enterprises in Vietnam

    This study aims to explore criteria for recruiting qualified personnel in logistics companies in Vietnam and apply these criteria into practical recruitment case by using a multiple decision-making method suggested by the authors. To explore recruitment criteria, the survey sample includes 224 logistics companies operating in one of the largest cities in Vietnam – Ho Chi Minh City. Cronbach’s Alpha testing methods and exploratory factor analysis (EFA) to test and build measurement scales are utilized. In addition, multiple linear regression method was used to find out the influence on recruitment selection decision. The research results show that there are four criteria affecting the decision to recruit qualified personnel including skills, knowledge, health, and personality traits. These criteria are then applied into personnel selection by using analytic hierarchy process method to select the best candidate. The paper offers practical help to industrial practitioners on their recruitment activities

    OMG U got flu? Analysis of shared health messages for bio-surveillance

    Background: Micro-blogging services such as Twitter offer the potential to crowdsource epidemics in real-time. However, Twitter posts ('tweets') are often ambiguous and reactive to media trends. In order to ground user messages in epidemic response we focused on tracking reports of self-protective behaviour such as avoiding public gatherings or increased sanitation as the basis for further risk analysis. Results: We created guidelines for tagging self protective behaviour based on Jones and Salath\'e (2009)'s behaviour response survey. Applying the guidelines to a corpus of 5283 Twitter messages related to influenza like illness showed a high level of inter-annotator agreement (kappa 0.86). We employed supervised learning using unigrams, bigrams and regular expressions as features with two supervised classifiers (SVM and Naive Bayes) to classify tweets into 4 self-reported protective behaviour categories plus a self-reported diagnosis. In addition to classification performance we report moderately strong Spearman's Rho correlation by comparing classifier output against WHO/NREVSS laboratory data for A(H1N1) in the USA during the 2009-2010 influenza season. Conclusions: The study adds to evidence supporting a high degree of correlation between pre-diagnostic social media signals and diagnostic influenza case data, pointing the way towards low cost sensor networks. We believe that the signals we have modelled may be applicable to a wide range of diseases

    Current status and behavior modeling on household solid-waste separation: a case study in Da Nang city, Vietnam

    This study focused on household solid-waste recycling in Da Nang city, Vietnam to assess the existing separation behavior and clarify the factors influencing the separation behavior. The authors conducted a questionnaire survey for 150 households in 6 urban districts, which consisted of household attributes, separation behavior, and the household's attitude on recycling and the environment. The waste separation rates were determined for leftover food and 13 recyclable items and the recyclable disposal habit was also assessed. The separation rate of leftover food was 77.3%. Among 13 surveyed recyclable items, plastic bottles and metal cans were two popular items with higher separation rate (72.5% and 63.8%, respectively). To identify the conscious structure and determinants of separation behavior, the authors developed a predictive model on the separation behavior of leftover food and recyclables by logistic and multiple linear regression analyses. The positive factors included behavior intention, sympathy for the collector, incentive brought by recycling, goal intention, internal norm, and perception of responsibility and seriousness. The negative factor was evaluation of trouble. The authors also analyzed the differences in separation rates among attributes. Based on the significant influence factors and attributes, the authors suggested how to promote separation behavior

    An Exploration in Social and Emotional Health of Vietnamese High School Students

    Social and emotional health (SEH) aims to promote academic success and create school well-being. SEH has not been studied in Vietnam. This article focused on exploring the SEH of Vietnamese high school students because of the high-risk level in mental health that appeared in this group. The study was carried out using a qualitative case study approach to interviewing 74 students, 12 teachers, 7 school administrators, and 4 school counselors. We interpreted four features of SEH\u27s expression of Vietnamese students: (1) Confident but lack of individual perspectives, (2) Respectful but lack of listening and empathy in school relationships, (3) Balanced but lack of authentic perception of emotions and effective emotional management, (4) Satisfied but lack of sustainability and action. This study has broadened our understanding of external behaviours and current limitations in the young people’s SEH from their perspectives in a developing Southeast Asian country to promote positive psychological development in school-based prevention programs

    Factors Influencing Consumer Behavior to Purchase Vegan Cosmetics in Vietnam

    The research aims to examine the factors that significantly influenced consumer behavior to purchase vegan cosmetics in Vietnam, including Reference Group, Consumer Perception, Salesperson Attitude, Product Quality, Price, Place, Promotion, and Brand. Data was collected through self-administered close-ended questionnaire from a sample of 480 consumers in Vietnam. For analysis purpose, SPSS 22 were used to confirm the validity concerns and determine the proposed relationship among selected variables. The output reveals that product quality is the strongest influencers of consumer behaviour to purchase vegan cosmetics, followed by Reference Group, Salesman Attitude, Place, Price, Promotion, Brand and Consumer Perception. This study provides a ‘snapshot’ to the government and cosmetics businesses about the determinants of consumer behaviour to purchase vegan cosmetics in Vietnam. Keywords: factors, consumer behaviour, vegan cosmetics DOI: 10.7176/JESD/14-6-03 Publication date:March 31st 202


    This article is written to provide thorough information about popular contents and forms of extracurricular sports performed by university students at a dormitory of Vietnam National University Ho Chi Minh City (VNUHCMC). The study uses document references, surveys, and statistical mathematics to investigate what extracurricular sports are favored by the students and how they organize their practices. The results indicate that the majority of students choose to practice football, volleyball, badminton, athletics, and martial arts. They play with themselves and/or in teams, without instructors, from 30 minutes to 02 hours in the afternoon after school time and/or in the morning, at the dormitory and/or sports centers.  Article visualizations
